Management of Fault Tolerance Information for Coordinated Checkpointing Protocol without Sympathetic Rollbacks

نویسندگان

  • Kwang-Sik Chung
  • Young-Jun Lee
  • Heon-Chang Yu
  • Won-Gyu Lee
چکیده

This paper presents the condition for an extended global recovery line for coordinated checkpointing protocol and a new garbage collection protocol on checkpoints and message logs in order to avoid the sympathetic rollback caused by lost messages. Since previous works assumed the communication channel does not lose the in-transit messages, those works on garbage collection in coordinated checkpointing protocols delete all the checkpoints except for the last checkpoints on each process. But coordinated checkpointing protocol based on the communication protocol with reliability (TCP) causes in-transit messages to be lost when a failure occurs, and lost messages lead to sympathetic rollbacks of faulty processes or related processes. Thus there is a need for management methods of fault tolerance information that can store and delete the coordinated checkpoint and light message log to avoid sympathetic rollback. In this paper, we define the extended global recovery line conditions for garbage collection of checkpoints and message logs for lost messages, and present the new garbage collection algorithm within the extended global recovery line. The proposed algorithm uses piggybacked process information on each message so that the additional messages for garbage collection and extended global recovery line are not needed. Since it relies on the piggybacked checkpoint information in communication message, the proposed garbage collection algorithm is called ‘the lazy garbage collection algorithm’.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment

Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...

متن کامل

Coordinated Checkpointing Without Direct Coordination

Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. Longrunning parallel applications and high-availability applications are two potential users of checkpointing, although with different requirements. Parallel applications need low failure-free overheads, and high-availability applications require fast and bounded recoveries. In this paper, we des...

متن کامل

Minimum Process Coordinated Checkpointing Scheme for Ad Hoc Networks

The wireless mobile ad hoc network (MANET) architecture is one consisting of a set of mobile hosts capable of communicating with each other without the assistance of base stations. This has made possible creating a mobile distributed computing environment and has also brought several new challenges in distributed protocol design. In this paper, we study a very fundamental problem, the fault tol...

متن کامل

Adaptive time-based coordinated checkpointing for cloud computing workfl ows

Cloud computing is a new benchmark towards enterprise application development that can facilitate the execution of workflows in business process management system. The workflow technology can manage the business processes efficiently satisfying the requirements of modern enterprises. Besides the scheduling, the fault tolerance is a very important issue in the workflow management. In this paper,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Inf. Sci. Eng.

دوره 20  شماره 

صفحات  -

تاریخ انتشار 2004